Visual Adjacency Multigraphs – a Novel Approach for a Web Page Classification
نویسندگان
چکیده
Standard techniques for a web page classification usually take a simple text-based approach, in which most of the information provided by the visual layout of a page is discarded. In our work we propose a new classification approach based on the visual layout analyses, conducted before implementing standard classification techniques. A page is represented as a hierarchical structure – Visual Adjacency Multigraph, in which nodes represent simple HTML objects (text, images) while directed edges reflect spatial relations ‘immediately before’, ‘immediately after’, ‘immediately left’ and ‘immediately right’ on the browser screen. Using visual information contained in the multigraph, one is able to define heuristics for recognition of common page entities such as vertical and horizontal link lists, titles and subtitles, and paragraphs of text. Visual analyses results in more accurate method for representing the page contents, which splits the text features into different subsets according to the groups they belong to. Finally, we introduce a classification system, which taking into account the proposed layout analysis clearly outperforms a standard bag-of-words approach.
منابع مشابه
A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification
In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملEdge-Coloring Bipartite Multigraphs to Select Network Paths
We consider the idea of using a centralized controller to schedule network traffic within a datacenter and implement an algorithm that edge-colors bipartite multigraphs to select the paths that packets should take through the network. We implement three different data structures to represent the bipartite graphs: a matrix data structure, an adjacency list data structure, and an adjacency list d...
متن کاملExpert Discovery: A web mining approach
Expert discovery is a quest in search of finding an answer to a question: “Who is the best expert of a specific subject in a particular domain within peculiar array of parameters?” Expert with domain knowledge in any field is crucial for consulting in industry, academia and scientific community. Aim of this study is to address the issues for expert-finding task in real-world community. Collabor...
متن کاملContent Based Web Sampling
Web characterization methods have been studied for many years. Most of these methods focus on textbased web contents. Some of them analyze the contents of a web page by analyzing its HTML code, hyper links, and/or DOM 1 structure. Seldom, a web page is characterized based on its visual appearance. A good reason for also considering the visual appearance of a web page is because humans initially...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004